import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%matplotlib inline
warnings.simplefilter(action='ignore', category=FutureWarning)
train_data = pd.read_csv('C:\\Users\\Owner\\Desktop\\Proga\\REPOS\\mentorship_EPAM\\data\\train.csv', index_col = "id")
train_data.head()
| date | store_nbr | family | sales | onpromotion | |
|---|---|---|---|---|---|
| id | |||||
| 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0 |
| 1 | 2013-01-01 | 1 | BABY CARE | 0.0 | 0 |
| 2 | 2013-01-01 | 1 | BEAUTY | 0.0 | 0 |
| 3 | 2013-01-01 | 1 | BEVERAGES | 0.0 | 0 |
| 4 | 2013-01-01 | 1 | BOOKS | 0.0 | 0 |
train_data.tail()
| date | store_nbr | family | sales | onpromotion | |
|---|---|---|---|---|---|
| id | |||||
| 3000883 | 2017-08-15 | 9 | POULTRY | 438.133 | 0 |
| 3000884 | 2017-08-15 | 9 | PREPARED FOODS | 154.553 | 1 |
| 3000885 | 2017-08-15 | 9 | PRODUCE | 2419.729 | 148 |
| 3000886 | 2017-08-15 | 9 | SCHOOL AND OFFICE SUPPLIES | 121.000 | 8 |
| 3000887 | 2017-08-15 | 9 | SEAFOOD | 16.000 | 0 |
train_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3000888 entries, 0 to 3000887 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date object 1 store_nbr int64 2 family object 3 sales float64 4 onpromotion int64 dtypes: float64(1), int64(2), object(2) memory usage: 137.4+ MB
train_data["family"] = train_data["family"].str.lower() # for easier reading
train_data.describe(percentiles=[.25, 0.375, .5, .75, .875]).drop(["count"])
# I dropped "count" row to get rid of exponentioal number presentation
# And added .875 percentile to understand approximately how many objects have no onpromotion
| store_nbr | sales | onpromotion | |
|---|---|---|---|
| mean | 27.500000 | 357.775749 | 2.602770 |
| std | 15.585787 | 1101.997721 | 12.218882 |
| min | 1.000000 | 0.000000 | 0.000000 |
| 25% | 14.000000 | 0.000000 | 0.000000 |
| 37.5% | 21.000000 | 2.000000 | 0.000000 |
| 50% | 27.500000 | 11.000000 | 0.000000 |
| 75% | 41.000000 | 195.847250 | 0.000000 |
| 87.5% | 48.000000 | 637.950875 | 2.000000 |
| max | 54.000000 | 124717.000000 | 741.000000 |
train_data["family"].describe()
count 3000888 unique 33 top prepared foods freq 90936 Name: family, dtype: object
The only categorical feature (in this table) "family" has 33 possible unique values (not so a lot)
That means we can easily use one-hot-encoding during model training
family_values = train_data["family"].unique()
family_values
array(['automotive', 'baby care', 'beauty', 'beverages', 'books',
'bread/bakery', 'celebration', 'cleaning', 'dairy', 'deli', 'eggs',
'frozen foods', 'grocery i', 'grocery ii', 'hardware',
'home and kitchen i', 'home and kitchen ii', 'home appliances',
'home care', 'ladieswear', 'lawn and garden', 'lingerie',
'liquor,wine,beer', 'magazines', 'meats', 'personal care',
'pet supplies', 'players and electronics', 'poultry',
'prepared foods', 'produce', 'school and office supplies',
'seafood'], dtype=object)
plt.figure(figsize=(20, 40))
for i, value in zip(range(len(family_values)), family_values):
plt.subplot(len(family_values)//3, 3, i+1)
current_value_data = train_data[train_data["family"] == value].groupby(["date"]).mean()
plt.xlim=(0, 1750)
plt.ylim=(0, current_value_data["sales"].max())
plt.scatter(x=np.arange(len(current_value_data.index)), y=current_value_data["sales"])
plt.title(value)
plt.xlim=(0, 1750)
plt.show()
Actually, these plots don't really help us, because, looking on them, we can only catch some common "trends", which describe global changes in some products sales.
For example, we can be sure that on the 1th January every year number of sales is equal to 0 (actually, it is necessary to check, but I think it is obvious).
Also we can say that the number of sales in general has a positive dynamic (for example, 'automotive', 'bread/bakery', 'grocery i', 'personal care', ...).
Some of the product types have a negative dynamic (such as lingerie).
Some goods have very interesting sales distribution (books, produce, froxen foods, ladieswear, ...), and we can't say right now, what is the reason of that.
Also some goods families have seasonal increase in sales('school and office supplies', 'liquor, wine, beer', 'grocery ii', 'frozen foods').
plt.figure(figsize=(20, 110))
for i, family_value in zip(range(len(family_values)), family_values):
plt.subplot(17, 2, i+1)
average_sales = train_data[train_data['family'] == family_value].groupby('date').mean()['sales']
trend = average_sales.rolling(
window=365,
center=True,
min_periods=183,
).mean()
ax = average_sales.plot(alpha=0.5)
ax = trend.plot(ax=ax, linewidth=3)
plt.title(family_value)
plt.show()
plt.figure(figsize=(20, 70))
for i, value in zip(range(len(family_values)), family_values):
plt.subplot(len(family_values)//3, 3, i+1)
current_value_data = train_data[train_data['family'] == value].groupby(["date"]).mean()
sns.distplot(current_value_data['sales'], color='g', bins=100, hist_kws={'alpha': 0.4});
plt.title(value)
plt.show()
These plots show that, roughly speaking, sales distributions of all families divided on two parts:
1) Normal or close to normal (such as 'automotive', 'bread/bakery', 'cleaning', 'eggs', 'grocery', 'lingerie', ...). Interesting fact that most of distributions from this category have right asymmetry (asymmetry coefficient is positive). The prove is below.
2) Distribution, where the biggest density is concentrated in zero or near zero. Other data is distributed differently (some values such as 'home and kitchen i', 'ladieswear' have something like normal distributions). It means that such goods categories aren't essential for people, that is why a number of sales during the day mostly is equal to 0.
for value in family_values:
print(value, ': ', round(train_data[train_data['family'] == value].groupby(["date"]).mean().skew()['sales'], 2))
automotive : 0.75 baby care : 2.92 beauty : 1.26 beverages : 0.5 books : 5.04 bread/bakery : 0.56 celebration : 0.29 cleaning : 0.66 dairy : 0.25 deli : 0.52 eggs : 0.73 frozen foods : 6.4 grocery i : 0.89 grocery ii : 3.16 hardware : 0.6 home and kitchen i : 0.43 home and kitchen ii : 0.82 home appliances : 0.26 home care : 0.08 ladieswear : 0.69 lawn and garden : 8.01 lingerie : 1.28 liquor,wine,beer : 1.83 magazines : 1.0 meats : 6.73 personal care : 0.95 pet supplies : 0.41 players and electronics : 0.43 poultry : 0.04 prepared foods : 0.28 produce : -0.14 school and office supplies : 5.81 seafood : 0.25
sales_and_onpromotion_data = train_data.drop(['date', 'store_nbr', 'family'], axis=1)
sales_and_onpromotion_data.corr()
| sales | onpromotion | |
|---|---|---|
| sales | 1.000000 | 0.427923 |
| onpromotion | 0.427923 | 1.000000 |
plt.figure(figsize=(20, 40))
for i, value in zip(range(len(family_values)), family_values):
plt.subplot(len(family_values)//3, 3, i+1)
current_value_data = train_data[train_data["family"] == value].groupby(["date"]).mean()
plt.scatter(x='onpromotion', y='sales', data=current_value_data)
plt.title(value)
plt.show()
It becomes understandable that promotions have a pretty good influence on sales INCREASING (in general, talking about all the data).
As we can see on the plots above, most of the 'family' values (but not all of them!) prove this fact. Interesting fact that 'books' didn't have any promotions during he whole period of observations.
Moreover, correlation is influenced by outliers, so this coefficient may not be accurate.
Nevertheless, 'onpromotion' feature is useful for the predictions.
stores_data = pd.read_csv('C:\\Users\\Owner\\Desktop\\Proga\\REPOS\\mentorship_EPAM\\data\\stores.csv')
stores_data.head()
| store_nbr | city | state | type | cluster | |
|---|---|---|---|---|---|
| 0 | 1 | Quito | Pichincha | D | 13 |
| 1 | 2 | Quito | Pichincha | D | 13 |
| 2 | 3 | Quito | Pichincha | D | 8 |
| 3 | 4 | Quito | Pichincha | D | 9 |
| 4 | 5 | Santo Domingo | Santo Domingo de los Tsachilas | D | 4 |
stores_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 54 entries, 0 to 53 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 store_nbr 54 non-null int64 1 city 54 non-null object 2 state 54 non-null object 3 type 54 non-null object 4 cluster 54 non-null int64 dtypes: int64(2), object(3) memory usage: 2.2+ KB
Store id doesn't help itself with the sales predictions, so it is necessary to replace the store id with the corresponding information about it
train_data = train_data.merge(stores_data, on="store_nbr", how="left")
train_data = train_data.drop('store_nbr', axis=1)
train_data.head()
| date | family | sales | onpromotion | city | state | type | cluster | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | automotive | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 1 | 2013-01-01 | baby care | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 2 | 2013-01-01 | beauty | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 3 | 2013-01-01 | beverages | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 4 | 2013-01-01 | books | 0.0 | 0 | Quito | Pichincha | D | 13 |
print('Number of \'city\', \'state\', \'type\' unique values: \n')
for feature in ['city', 'state', 'type']:
print(feature, len(train_data[feature].unique()))
Number of 'city', 'state', 'type' unique values: city 22 state 16 type 5
plt.figure(figsize=(25,9))
sns.countplot(x=train_data['city'], alpha=0.7, data=train_data)
<AxesSubplot:xlabel='city', ylabel='count'>
plt.figure(figsize=(25,9))
sns.countplot(x=train_data['state'], alpha=0.7, data=train_data)
<AxesSubplot:xlabel='state', ylabel='count'>
It seems that the presence of both features 'state' and 'city' isn't necessary, that is why we should delete one of them.
I'll choose 'state' feature, because 'city' feature gives us more information (there can be few cities in the state).
'city' feature has only 7 more values than 'state' feature, that is why speaking about the model complexity, there should not be much difference.
train_data = train_data.drop('state', axis=1)
train_data.head()
| date | family | sales | onpromotion | city | type | cluster | |
|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | automotive | 0.0 | 0 | Quito | D | 13 |
| 1 | 2013-01-01 | baby care | 0.0 | 0 | Quito | D | 13 |
| 2 | 2013-01-01 | beauty | 0.0 | 0 | Quito | D | 13 |
| 3 | 2013-01-01 | beverages | 0.0 | 0 | Quito | D | 13 |
| 4 | 2013-01-01 | books | 0.0 | 0 | Quito | D | 13 |
train_data_copy = train_data.copy()
train_data_copy.head()
| date | family | sales | onpromotion | city | type | cluster | |
|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | automotive | 0.0 | 0 | Quito | D | 13 |
| 1 | 2013-01-01 | baby care | 0.0 | 0 | Quito | D | 13 |
| 2 | 2013-01-01 | beauty | 0.0 | 0 | Quito | D | 13 |
| 3 | 2013-01-01 | beverages | 0.0 | 0 | Quito | D | 13 |
| 4 | 2013-01-01 | books | 0.0 | 0 | Quito | D | 13 |
plt.figure(figsize=(20, 100))
for i, family_value in zip(range(len(family_values)), family_values):
plt.subplot(17, 2, i+1)
for city in train_data_copy['city'].unique():
average_sales = train_data_copy[(train_data_copy['family'] == family_value) &
(train_data_copy['city'] == city)].groupby('date').mean()['sales']
trend = average_sales.rolling(
window=365,
center=True,
min_periods=183,
).mean()
ax = average_sales.plot(alpha=0)
ax = trend.plot(ax=ax, linewidth=3)
plt.title(family_value)
plt.show()
plt.figure(figsize=(7,5))
sns.countplot(x=train_data['type'], alpha=0.7, data=train_data)
<AxesSubplot:xlabel='type', ylabel='count'>
plt.figure(figsize=(20, 100))
for i, family_value in zip(range(len(family_values)), family_values):
plt.subplot(17, 2, i+1)
for store_type in train_data_copy['type'].unique():
#plt.plot(np.arange(len(current_type_and_family_data.index)), current_type_and_family_data['sales'])
average_sales = train_data_copy[(train_data_copy['family'] == family_value) &
(train_data_copy['type'] == store_type)].groupby('date').mean()['sales']
trend = average_sales.rolling(
window=365,
center=True,
min_periods=183,
).mean()
ax = average_sales.plot(alpha=0)
ax = trend.plot(ax=ax, linewidth=3)
plt.title(family_value)
plt.legend(train_data_copy['type'].unique())
plt.show()
As we can see, the store type has a great impact on the target (type D is leader and it is logical, because the plot above shows us, that type D stores are the most), that is why 'type' feature is very important.
plt.figure(figsize=(7,5))
sns.countplot(x=train_data['cluster'], alpha=0.7, data=train_data)
<AxesSubplot:xlabel='cluster', ylabel='count'>
sales_and_cluster_data = train_data_copy.drop(['date', 'family', 'onpromotion', 'city', 'type'], axis=1)
sales_and_cluster_data.corr()
| sales | cluster | |
|---|---|---|
| sales | 1.000000 | 0.038525 |
| cluster | 0.038525 | 1.000000 |
plt.figsize=(10, 7)
plt.scatter(x='cluster', y='sales', data=sales_and_cluster_data)
<matplotlib.collections.PathCollection at 0x25c1662d7c0>
'cluster' feature doesn't correlate with the target, and I can't see any dependecies between these 2 features. But I think that this feature can be useful, because it connects similar stores together.
oil_data = pd.read_csv('C:\\Users\\Owner\\Desktop\\Proga\\REPOS\\mentorship_EPAM\\data\\oil.csv')
oil_data["time"] = np.arange(len(oil_data.index))
oil_data.head()
| date | dcoilwtico | time | |
|---|---|---|---|
| 0 | 2013-01-01 | NaN | 0 |
| 1 | 2013-01-02 | 93.14 | 1 |
| 2 | 2013-01-03 | 92.97 | 2 |
| 3 | 2013-01-04 | 93.12 | 3 |
| 4 | 2013-01-07 | 93.20 | 4 |
oil_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1218 entries, 0 to 1217 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 1218 non-null object 1 dcoilwtico 1175 non-null float64 2 time 1218 non-null int32 dtypes: float64(1), int32(1), object(1) memory usage: 23.9+ KB
oil_data_copy = oil_data.copy()
oil_data_copy.plot(kind="scatter", x="time", y="dcoilwtico")
<AxesSubplot:xlabel='time', ylabel='dcoilwtico'>
train_data = train_data.merge(oil_data, on='date', how="left")
train_data.head()
| date | family | sales | onpromotion | city | type | cluster | dcoilwtico | time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | automotive | 0.0 | 0 | Quito | D | 13 | NaN | 0.0 |
| 1 | 2013-01-01 | baby care | 0.0 | 0 | Quito | D | 13 | NaN | 0.0 |
| 2 | 2013-01-01 | beauty | 0.0 | 0 | Quito | D | 13 | NaN | 0.0 |
| 3 | 2013-01-01 | beverages | 0.0 | 0 | Quito | D | 13 | NaN | 0.0 |
| 4 | 2013-01-01 | books | 0.0 | 0 | Quito | D | 13 | NaN | 0.0 |
train_data.columns
Index(['date', 'family', 'sales', 'onpromotion', 'city', 'type', 'cluster',
'dcoilwtico'],
dtype='object')
train_data.drop(columns=['time'])
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-122-5cb481e9851d> in <module> ----> 1 train_data.drop(columns=['time']) ~\anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors) 4306 weight 1.0 0.8 4307 """ -> 4308 return super().drop( 4309 labels=labels, 4310 axis=axis, ~\anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors) 4151 for axis, labels in axes.items(): 4152 if labels is not None: -> 4153 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 4154 4155 if inplace: ~\anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors) 4186 new_axis = axis.drop(labels, level=level, errors=errors) 4187 else: -> 4188 new_axis = axis.drop(labels, errors=errors) 4189 result = self.reindex(**{axis_name: new_axis}) 4190 ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors) 5589 if mask.any(): 5590 if errors != "ignore": -> 5591 raise KeyError(f"{labels[mask]} not found in axis") 5592 indexer = indexer[~mask] 5593 return self.delete(indexer) KeyError: "['time'] not found in axis"
train_data = train_data.drop(['time'], axis=1)
train_data_oil = train_data.fillna(train_data['dcoilwtico'].mean())
train_data_oil.head()
| date | family | sales | onpromotion | city | type | cluster | dcoilwtico | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | automotive | 0.0 | 0 | Quito | D | 13 | 67.925589 |
| 1 | 2013-01-01 | baby care | 0.0 | 0 | Quito | D | 13 | 67.925589 |
| 2 | 2013-01-01 | beauty | 0.0 | 0 | Quito | D | 13 | 67.925589 |
| 3 | 2013-01-01 | beverages | 0.0 | 0 | Quito | D | 13 | 67.925589 |
| 4 | 2013-01-01 | books | 0.0 | 0 | Quito | D | 13 | 67.925589 |
train_data.corr()
| sales | onpromotion | cluster | dcoilwtico | |
|---|---|---|---|---|
| sales | 1.000000 | 0.427923 | 3.852528e-02 | -7.905016e-02 |
| onpromotion | 0.427923 | 1.000000 | 5.666749e-03 | -1.524182e-01 |
| cluster | 0.038525 | 0.005667 | 1.000000e+00 | -3.704725e-14 |
| dcoilwtico | -0.079050 | -0.152418 | -3.704725e-14 | 1.000000e+00 |
train_data_oil.corr()
| sales | onpromotion | cluster | dcoilwtico | |
|---|---|---|---|---|
| sales | 1.000000 | 0.427923 | 3.852528e-02 | -5.777101e-02 |
| onpromotion | 0.427923 | 1.000000 | 5.666749e-03 | -1.381992e-01 |
| cluster | 0.038525 | 0.005667 | 1.000000e+00 | -2.915435e-14 |
| dcoilwtico | -0.057771 | -0.138199 | -2.915435e-14 | 1.000000e+00 |
holidays_events_data = pd.read_csv('C:\\Users\\Owner\\Desktop\\Proga\\REPOS\\mentorship_EPAM\\data\\holidays_events.csv')
holidays_events_data.head()
| date | type | locale | locale_name | description | transferred | |
|---|---|---|---|---|---|---|
| 0 | 2012-03-02 | Holiday | Local | Manta | Fundacion de Manta | False |
| 1 | 2012-04-01 | Holiday | Regional | Cotopaxi | Provincializacion de Cotopaxi | False |
| 2 | 2012-04-12 | Holiday | Local | Cuenca | Fundacion de Cuenca | False |
| 3 | 2012-04-14 | Holiday | Local | Libertad | Cantonizacion de Libertad | False |
| 4 | 2012-04-21 | Holiday | Local | Riobamba | Cantonizacion de Riobamba | False |
transactions_data = pd.read_csv('C:\\Users\\Owner\\Desktop\\Proga\\REPOS\\mentorship_EPAM\\data\\transactions.csv')
transactions_data.head()
| date | store_nbr | transactions | |
|---|---|---|---|
| 0 | 2013-01-01 | 25 | 770 |
| 1 | 2013-01-02 | 1 | 2111 |
| 2 | 2013-01-02 | 2 | 2358 |
| 3 | 2013-01-02 | 3 | 3487 |
| 4 | 2013-01-02 | 4 | 1922 |